Delineation of complex gene expression patterns in single cell RNA-seq data with ICARUS v2.0

Abstract Complex biological traits and disease often involve patterns of gene expression that can be characterised and examined. Here we present ICARUS v2.0, an update to our single cell RNA-seq analysis web server with additional tools to investigate gene networks and understand core patterns of gene regulation in relation to biological traits. ICARUS v2.0 enables gene co-expression analysis with MEGENA, transcription factor regulated network identification with SCENIC, trajectory analysis with Monocle3, and characterisation of cell-cell communication with CellChat. Cell cluster gene expression profiles may be examined against Genome Wide Association Studies with MAGMA to find significant associations with GWAS traits. Additionally, differentially expressed genes may be compared against the Drug-Gene Interaction database (DGIdb 4.0) to facilitate drug discovery. ICARUS v2.0 offers a comprehensive toolbox of the latest single cell RNA-seq analysis methodologies packed into an efficient, user friendly, tutorial style web server application (accessible at https://launch.icarus-scrnaseq.cloud.edu.au/) that enables single cell RNA-seq analysis tailored to the user's dataset.


INTRODUCTION
Many biological networks contain patterns of gene expression that are evolutionary conserved. These patterns of networks are comprised of intertwined signalling pathways that are fundamentally important for biological functionality including development, differentiation, response to stimuli and senescence. During disease, many of these pathways become dysregulated and characterisation of these affected cell populations and their regulatory networks is vital for understanding disease mechanisms and pathogenesis (1)(2)(3). Here, we present an update to ICARUS (4), our web server tool to enable users without experience in R to undertake single cell RNA-seq analysis and aid hypothesis generation. ICARUS v2.0 assembles the latest single cell RNA-seq gene network construction and analysis tools to allow interpretation and deconstruction of complex biological traits.
ICARUS v2.0 supports establishment of gene networks through the implementation of; Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) (5), transcription factor regulatory network construction through single-cell regulatory network inference and clustering (SCENIC) (6), characterisation of gene expression changes through pseudotime (cell trajectories) with Mon-ocle3 (7-9) and cell-cell communication signalling classification through identification of ligand-receptor pairs with CellChat (10). Furthermore, the gene expression profiles of cell populations can be examined for associations with Genome Wide Association Study (GWAS) traits using the multi-marker analysis of genomic annotation (MAGMA) tool (11,12) to explore crucial cell populations that may drive the affected phenotype. Finally, differentially expressed genes can be queried against the drug-gene interactions database, DGIdb 4.0 (13) to facilitate the identification of drug targets for therapeutic testing.
Further updates to data processing methods are also included in this update. These include, a quality control step to identify cell doublets/multiplets using DoubletFinder (14), an improved method of data normalisation, SCTransform (15,16), and a new method for cell cluster labelling, sctype (17) that congregates cell type specific markers from CellMarker (18) and PanglaoDB (19) databases. Summary of R packages used in ICARUS v2.0 including the main command are detailed in Table 1. ICARUS v2.0 is accessible through an efficient, user-friendly web server at https: //launch.icarus-scrnaseq.cloud.edu.au/.

Updates to processing methods
Doublet removal. Cell doublets or multiplets may arise during scRNA-seq library preparation when dropletbased microfluidics methods partition multiple cells into a  (14). DoubletFinder first simulates 'artificial doublets' using transcriptional profiles from pairs of cells in real data. The artificial doublets are merged with real data and dimensionality reduction is performed. Lastly, a k-nearest neighbour graph is developed, and each cell is scored based on its proximity to artificial doublets. The highest scoring cells are assigned as real doublets. The user has the option to visualise and remove these doublets for further analysis.
SCTransform. Scaling and normalization of raw gene count matrices of single cell RNA-seq data is a common practice to account for technical confounders such as differences in sequence depth. ICARUS v2.0 now includes the Seurat SCTransform method of data normalisation which recovers sharper biological distinctions between formed cell clusters compared to log-normalisation. The SCTransform methodology computes Pearson residuals from a negative binomial regression of the raw count data and applies dimensionality reduction methods (15,16).
Sctype and additional SingleR datasets. The expression profiles of marker genes for each cell cluster may be compared against existing cell marker datasets to assign cell type labels to clusters. ICARUS v2.0 introduces cell cluster labelling with sctype (17), an ultra-fast unsupervised method for cell type annotation using compiled cell markers from CellMarker (http://bio-bigdata.hrbmu.edu. cn/CellMarker/) (18) and PanglaoDB (https://panglaodb. se/) (19) databases. Sctype is computationally efficient and includes marker gene sets for 11 different tissue types including brain, pancreas, immune system, liver, eye, kid-ney, lung, embryonic, gastrointestinal tract, muscle and skin (17). ICARUS provides the SingleR supervised cell-type assignment algorithm to annotate cell clusters by comparison to previously annotated cell types in single cell datasets (25). ICARUS v2.0 extends the utility of SingleR by including additional reference datasets including Darmanis human temporal lobe data (26); Zhong human embryonic prefrontal cortex data (27); Baron human and mouse pancreas data (28); Lawlor human non-diabetic and diabetic pancreas data (29); Muraro human pancreas data (30); Segerstolpe human non-diabetic and diabetic pancreas data (31) and the He organ atlas, which comprises data from 15 organs (bladder, blood, common bile duct, oesophagus, heart, liver, lymph node, bone marrow, muscle, rectum, skin, small intestine, spleen, stomach and trachea) (32).

Gene network construction
MEGENA: gene co-expression modules. Patterns of biological networks can be grouped into hierarchal coexpression modules which infer functionality and dictate biological causality. A method of detecting these gene co-expression modules is multiscale embedded gene coexpression network analysis (MEGENA) which employs a planar maximally filtered graph to extract significant gene interactions and constructs co-expression gene modules (5). MEGENA first classifies gene co-expression modules through identification of significant gene interactions via planar filtered network construction and embedding on a topological sphere. Multi-scale clustering is then applied to group similar and dissimilar gene modules. Lastly, multiscale hub analysis is used to connect clusters in a hierarchical structure.
ICARUS v2.0 performs MEGENA on either a set of highly variable genes (Seurat::FindVariableFeatures) or user computed differentially expressed genes to generate a set of hierarchically ordered co-expression modules, where larger modules progressively branch into smaller submod-ules. A heatmap of significant Gene Ontology terms associated with each co-expression module is also provided to aid interpretation and hypothesis generation. Additionally, a hierarchical sunburst plot of computed MEGENA co-expression modules is produced which highlights the cell cluster/cell type that displays the highest activity of the corresponding module. Module activity is computed as a percentage of cluster marker gene overlap (Seurat::FindAllMarkers) with co-expression module genes.
SCENIC: transcription factor regulatory networks. A cell's transcriptional state may be characterised by gene regulatory networks (GRNs) that are formed by transcription factors and cofactors that regulate each other and their downstream gene targets. ICARUS v2.0 utilises the SCENIC R package (6) to characterise cell cluster/cell type specific GRNs using either a set of highly variable genes (Seurat::FindVariableFeatures) or user computed differentially expressed genes. SCENIC performs cis-regulatory transcription factor binding motif analysis on a set of coexpressed transcription factors and variable genes. Transcription factor motifs in the promoter region of genes (up to 500 bp upstream of the transcription start site) and 20kb around the transcription start site (±10 kb) are scored. ICARUS v2.0 currently supports SCENIC v1.2.4 which includes species specific transcription factor binding motif databases for human, mouse, and fly. The activity of these transcription factor regulated gene modules is scored across cell populations using the AUCell algorithm (6). ICARUS v2.0 visualises regulated gene module activity across cell populations in 2D/3D UMAP and t-SNE plots. Heatmap and dotplot visualisations are also provided.
Monocle3: trajectory analysis. The expression profiles of cells change during development, in response to stimuli and throughout life. Trajectory analysis aims to determine the sequence of gene expression changes in single cell RNA-seq datasets. ICARUS v2.0 employs the Monocle3 algorithm (8) to graph cells according to their progress in pseudotime, a measure of the amount of transcriptional change a cell undergoes from its beginning to end states. Calculation of pseudotime requires the user to select a population of cells where the beginning of the biological process is likely located (known as root cells). Root cells may be selected from a specific cell population/cell type or interactively using a lasso select function. A table of genes that exhibit changes across pseudotime is also provided.
CellChat: cell-cell ligand receptor signalling. Cells interact and communicate with each other through ligandreceptor pairs that coordinate many biological processes in both healthy and diseased conditions. The characterisation of cell-cell signalling crosstalk through soluble agonists/antagonists and membrane bound cofactors may help us interpret and understand complex networks of systems biology. ICARUS v2.0 employs the CellChat (10) tool to quantitatively infer and analyse intercellular communication networks from single cell RNA-seq data. CellChat incorporates a manually curated comprehensive database of ligand-receptor pairs, soluble agonists/antagonists and stimulatory/inhibitory membrane bound co-receptors to infer cell-cell communication interactions based on social network analysis tools, pattern recognition methods and manifold learning approaches. ICARUS v2.0 currently supports CellChat v1.4.0 incorporating signalling molecule interaction databases for human and mouse. The option to analyse a single dataset or comparison between two datasets is available.
MAGMA: cell type GWAS trait associations. Genome wide association studies have identified several loci and genes that are associated with a trait of interest. The expression profiles of cell populations in single cell RNA-seq data can be compared against these GWAS loci to identify potentially affected cell types underlying complex traits. ICARUS employs the multi-marker analysis of genomic annotation (MAGMA) methodology to identify increased linear association between cell type derived gene sets and genelevel GWAS summary statistics, with the hypothesis that in affected cell types, expressed genes should be more associated with the GWAS trait (11,12). ICARUS v2.0 utilises the MAGMA.Celltyping v.2.0.6 R package to to identify cell types that may explain the heritability signal from GWAS summary statistics. The user may either upload their own GWAS summary statistics or select biological traits from over 700 public GWAS datasets from the IEU open GWAS project (https://gwas.mrcieu.ac.uk/), UK biobank (http://www.nealelab.is/uk-biobank) and FinnGen (https: //www.finngen.fi/fi). Public GWAS datasets were curated by neurogenomics/MAGMA Files Public (https://github. com/neurogenomics/MAGMA Files Public) and includes gene level GWAS statistics for 10b upstream and 1.5 kb downstream of each associated loci.
Drug-gene interaction database. The Drug-Gene Interaction Database (DGIdb, www.dgidb.org) (13,33,34) curates information on druggable genes from publications (35,36), known affected pathways (Gene Ontology, Human Protein Atlas, IDG) and publicly available databases (Drug-Bank, PharmGKB, Chembl, Drug Target Commons and TTD). ICARUS v2.0 offers an option to query differentially expressed genes against DGIdb v4.0 (13) and return a list of potential drug targets with information of interaction type (i.e. inhibitor, agonist, blocker), interaction claim source (i.e. PharmGKB, ChemblInteractions, etc), interaction group score (score takes into account number of drug and gene partners and number of supporting publications) and the relevant pubmed reference if available. Suitable drugs may be tested in a laboratory setting and contribute to discovery of repurposed drugs for clinical benefit.

RESULTS
To demonstrate the utility of ICARUS v2.0, we have examined a single-nuclei RNA-seq data of the substantia nigra from five human individuals with absence of neurological clinical disease. Data is available from human cell atlas (https://data.humancellatlas.org/explore/projects/ 996120f9-e84f-409f-a01e-732ab58ca8b9) (37). Using this dataset, a quality control filter was first applied removing low quality cells with unique gene counts less than  200 and cells with high mitochondrial reads > 5%. Dimensionality reduction with SCTransform was then performed, prioritising 3000 highly variable genes. Substantia nigra nuclei from the 5 individuals were integrated using harmony (38). The functionality of harmony integration was detailed in the original ICARUS publication (4). Cell clustering was performed with the first 25 dimensions, a k-nearest neighbour value of 20 and the Louvain community detection algorithm. Cluster labelling with sctype (17)  To showcase the gene network analysis capabilities of ICARUS v2.0, we focused on the oligodendrocyte precursor cell (OPC) cluster. MEGENA co-expression module analysis using 2,000 variable genes as input, highlighted a functional module involved in the regulation of extracellular organisation, synaptic signalling and developmental morphogenesis (gene module GO analysis, Supplementary Figure 1) that is centred around the LHFPL3-PTPRZ1-TNR gene cluster ( Figure 1C). Trajectory analysis selecting the OPCs cluster as the root cells, showcased a gradual change in pseudotime (changes in cell transcriptional state) starting from oligodendrocyte precursor cells and ending in microglia and astrocytes (Figure 2A). Top genes that change as a function of pseudotime included MEG3, DSCAM, LRRC4C (full list of genes can be found in supplementary table 1). The transition from OPC to astrocytes and microglia appears to be controlled by the cell differentiation and development related transcription factors, FOS, JUN and JUNB as determined by SCENIC transcription factor regulatory network analysis ( Figure 1E). Regulated gene module activity of the JUN family of transcription factors is detailed in Figure 1F. Cell-cell communication through ligand-receptor pairs showed a greater number of interactions between OPCs and astrocytes and fewer interactions with all other cell types ( Figure 2C). OPC outgoing signalling pathways included NGL, PTN, CNTN, APP and TENASCIN pathways and incoming signals from NRXN, NCAM, CNTN, PTN, NGL, PSAP and CALCR were detected ( Figure 2D and E).

DISCUSSION
One of the main reasons for single-cell RNA-seq is to aid in the generation of hypotheses that can be further validated by other experimental techniques. This update to our web server, ICARUS introduces extended tools for data interpretation with a focus on hypothesis generation. ICARUS v2.0 enables users to examine complex gene networks at the single cell level utilising well-established methodologies.
We have demonstrated the capabilities of ICARUS v2.0 on a publicly available dataset of the substantia nigra of five human individuals with absence of neurological disease (37). Analysis of the oligodendrocyte precursor cell cluster demonstrated the cellular pathway control of cell differentiation and synapse organisation that is likely centred around the co-expression of LHFPL3-PTPRZ1-TNR gene cluster. A clearly defined trajectory from OPCs to astrocytes and microglia mediated by FOS, JUN and JUNB transcription factors was observed that is further supported by increased number of cell-cell signalling from OPCs to astrocytes. Signalling pathways involved include NGL, CNTN and PTN. These observations further contribute to our understanding of the oligodendrocyte progenitor cell lineage and demonstrate the utility of this analysis (55). Another source of hypothesis generation can be achieved through investigation of potential GWAS trait associated cell populations. Using this dataset of human substantia nigra, we confirm the observation first described by Webber et al. (37) of a cell type association between oligodendrocytes, OPCs and GABAergic neuronal gene expression with genetic risk of Schizophrenia and bipolar disorder using the MAGMA methodology. This potentially points to a risk susceptibility for certain cell types underlying these diseases. Finally, the identification of drug-gene targets can facilitate targeted perturbations of key molecular pathways. In the context of dysregulation or disease, this could provide an avenue towards a therapeutic opportunity through repurposed drugs. We identified a possible drug target for the LHFPL3-PTPRZ1-TNR gene cluster expressed in OPCs with a signal transduction activator, Phorbol 12-myristate 13-acetate (39,40). Bioinformatics, 2023, Vol. 5, No. 2 7 The updates to ICARUS v2.0 described in this manuscript will provide users with unparalleled resolution into gene networks and understanding of gene regulation in relation to biological traits. The single cell RNA-seq research field is one of the fastest growing fields with over 4000 research articles published in 2021 alone (Scopus database). ICARUS will continue to see updates as new methodologies are developed to provide the user with a state-of-the-art resource for novel discoveries.

DATA AVAILABILITY
The functionality of ICARUS was demonstrated on a single nuclei RNA-seq dataset of the human substantia nigra from 5 individuals with absence of neurological disease. Data is available from human cell atlas (https://data.humancellatlas.org/explore/projects/ 996120f9-e84f-409f-a01e-732ab58ca8b9).

CODE AVAILABILITY
ICARUS is available at https://launch.icarus-scrnaseq. cloud.edu.au/. The application is free and open to all users with no login requirement.
R source code of the ICARUS v2.0 shiny app is available at doi.org/10.5281/zenodo.7735960. Alternatively, a docker version is accessible through the Docker Hub under the name 'icarusscrnaseq/icarus'.